Add some additional information to customize the knitted document:

date: "October 01, 2020"
output:
  html_document:
    number_sections: yes
    theme: cerulean
    toc: yes
    toc_depth: 5
    toc_float: yes
  pdf_document:
    toc: yes
    toc_depth: '5'

This will add a table of contents (toc) and will change the colors (theme: cerulean)

To find your favorite Rmarkdown theme: https://www.datadreaming.org/post/r-markdown-theme-gallery/

knitr::opts_chunk$set(cache=TRUE, fig.path='figures/', fig.width=8, fig.height=5 )

This saves all figures in the directory figures and sets the default figure size

1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Rmarkdown Cheatsheet: https://rmarkdown.rstudio.com/lesson-15.html

“#” hash signs indicate headers.

The number of hashes equals the header level.

1.1 h2

1.1.1 h3

1.1.2 h4

placing a single asterisk on either side of a phrase makes it italic.

double asterisks make a word or phrase bold.

triple asterisks make a word or phrase bold and italic.

  • a single asterisk at the beginning of a line makes a bullet
  1. and a number at the begining of a line creates a numbered item.
  2. this should add

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Execute this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

1.2 Including Plots & Images

You can also embed plots, for example:

(Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.)

echo =FALSE will only display the output, not the code.

Some more chunk options: * Use echo=FALSE to avoid having the code itself shown. * Use results="hide" to avoid having any results printed. * Use eval=FALSE to have the code shown but not evaluated. * Use warning=FALSE and message=FALSE to hide any warnings or messages produced. * Use fig.height and fig.width to control the size of the figures produced (in inches).

naming chunks = good practice (the above chunk was named pressure) * helps navigate around the document & this is what the figures will be named

(check the Rproject directory after knitting)

You can also include images from your local computer or from the web:

#!

1.3 Adding tables

Can type out tables:

col name
1 1 1
2 2 2

Alternatively, you can use the knitr package to make mardown tables from data frames:

speed dist
4 2
4 10
7 4
7 22
8 16
9 10

left, right, center adjust

1.4 Knitting

When you knit the file, an HTML file containing the code and output will be saved alongside it (click the Knit button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor (Viewer tab).

2 Rprojects

Rproject Benefits:

  • No need to set the working directory. All paths are relative to the directory containing the Rproject.

    Whenever you open your project, the working directory is automatically set to where your project is. This means your code will not break when you work on a different computer.

  • RStudio projects allow you to open multiple projects at the same time with each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.

Good organization / project lay out will:

  • ensure the integrity of your data
  • make it easier to collaborate
  • make it easier to a pick a project back up after a break

Project Management tips:

  • treat raw data as “read only”
  • create separate directory for “cleaned data” (or don’t save altered data files - will see later on with dplyr) - results
  • generated output is disposable (because your analysis is reproducible!)
  • put scripts in src directory
  • name all files to reflect their content or function (e.g. fig1_pca_communitycomposition.jpg not Rplot1.jpg)
  • avoid duplication - as code for a project matures, you will want to start splitting out functions into separate scripts. These scripts might be useful across multiple projects. When reusing a script, use a symbolic link to save space on your computer and avoid having to update a file in multiple places. Data that is reused can also be symbolically linked (ln -s)

data for this workshop

following good project management practices, make a new directory called data and download the data we will be playing with in this workshop into that directory:

In terminal tab:

mkdir data

cd data

wget https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv

curl

We will use the data later, but we can get a general sense of the data by looking at it in the terminal, which will help us decide how to load it into R later:

wc -l gapminder_data.csv

head gapminder_data.csv

cd -

3 GitHub

go to your GitHub account and make a new repository DO NOT initialize with a README

follow the instructions on the next page

(in terminal tab)

echo "# SkillPill_ReproducibleR" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/maggimars/SkillPill_ReproducibleR.git
git push -u origin master

README.md is a markdown file, just like this Rmarkdown file in many ways- uses similar syntax.

try also adding your data directory to your Github repository!

Alternatively - you can use the Rstudio interface to version control with Git https://swcarpentry.github.io/git-novice/14-supplemental-rstudio/

(I prefer command line)

4 A few notes on getting help

?function_name

If you can’t really remember a function name ??function_name

pro-tip From within the function help page, you can highlight code in the Examples and hit Ctrl+Return to run it in RStudio console. This is gives you a quick way to get a feel for how a function works.

?kable

for special operators use quotes, e.g. ?"<-" Without any arguments,vignette()will list all vignettes for all installed packages;vignette(package=“package-name”)will list all available vignettes for package-name, andvignette(“vignette-name”)will open the specified vignette. And then there is always google. # Reproducible and Streamlined Analyses (Day 2) ## Exploring the sample data We already looked at the sample data in Terminal and saw that it was a.csv` file with 1705 lines and that it does have a header.

gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)

View data in another tab with View()

gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

4.1 Tidyr

Lesson Materials: http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html

CheatSheet: https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf

be aware: Gather/Spread has been renamed pivot_longer / pivot_wider

Extra information:

vignette("pivot")
## starting httpd help server ... done

Long v. Wide

SO what is this all about?

Long and wide dataframe layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

Researchers often want to reshape their dataframes from ‘wide’ to ‘longer’ layouts, or vice-versa. The ‘long’ layout or format is where: * each column is a variable * each row is an observation In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R’s functions have been designed assuming you have ‘longer’ formatted data. (Especially ggplot!)

library(tidyr)

Question:

Is gapminder a purely long, purely wide, or some intermediate format?

Using pivot:

Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide formatted version of the gapminder dataset.

Challenge:

Download this dataset into your data directory: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv

Then read the data into R and name the dataframe gap_wide:

or

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
#Added 'as.data.frame' to avoid problems associated with the data.table format it comes in

gap_wide <- fread("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv", header=TRUE)

That gives us this:

and we want to practice pivoting longer with pivot_longer():

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)
## tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
##  $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
##  $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
##  $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

can also use “-” syntax!

using separate

gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)

Challenge: Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions

gap_long %>% 
  group_by(continent, obs_type) %>% 
  summarize(mean_obs = mean(obs_values))
## # A tibble: 15 x 3
## # Groups:   continent [5]
##    continent obs_type    mean_obs
##    <chr>     <chr>          <dbl>
##  1 Africa    gdpPercap     2194. 
##  2 Africa    lifeExp         48.9
##  3 Africa    pop        9916003. 
##  4 Americas  gdpPercap     7136. 
##  5 Americas  lifeExp         64.7
##  6 Americas  pop       24504795. 
##  7 Asia      gdpPercap     7902. 
##  8 Asia      lifeExp         60.1
##  9 Asia      pop       77038722. 
## 10 Europe    gdpPercap    14469. 
## 11 Europe    lifeExp         71.9
## 12 Europe    pop       17169765. 
## 13 Oceania   gdpPercap    18622. 
## 14 Oceania   lifeExp         74.3
## 15 Oceania   pop        8874672.

Going in the other direction … (time dependent)

5 Ggplot!

Cheatsheet: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Cook Book: http://www.cookbook-r.com/Graphs/

Plotting can be for exploration or for sharing. Often what a plot looks like will depend on why you are making it (i.e. exploration v. sharing).

Explore:

Share:

Question: What would be different in the you you make a plot based on whether it is for exploring or sharing?

Since this minicourse is about reproducible analyses for sharing and publishing - we will work on making sharable plots.

library(ggplot2) # we already loaded, but just incase

GG = the grammar of graphics

ggplots are built in layers (same concept as illustrator)

1st layer:

ggplot()

the “canvas” so to speak

2nd layer:

ggplot(gapminder, aes(x=year, y = gdpPercap)) 

gapminder %>% 
  ggplot(aes(x=year, y=gdpPercap))

The aesthetics (based on the data)

3rd layer:

ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
  geom_point()

Challenge: Modify the example so that the figure shows how life expectancy has changed over time

#Change the x=
ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_point()

#make points more transparent
ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_point(alpha = 0.5)

#de-agregate the points 
ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_jitter(alpha = 0.5)

#de-agregate the points with less spread
ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_jitter(alpha = 0.5, width = 1)

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column. What trends do you see in the data? Are they what you expected?

ggplot(gapminder, aes(x=year, y = lifeExp, color = continent)) +
  geom_jitter(alpha = 0.5, width = 1)

#Two ways to do it
ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_jitter(alpha = 0.5, width = 1, aes(color = continent))

5.0.2 geoms

bar, box, violin, text, line, smooth (lm or loess) …. order matters

geom_hline(yintercept, linetype, color, size)

alpha() jitter() scale_x_log10()

ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_boxplot(aes(continent))

ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_violin(aes(continent))

ggplot(gapminder, aes(x=year, y = lifeExp)) +
  geom_violin(aes(continent)) +
  geom_point(aes(x=year, y = lifeExp))

5.0.3 Labels

labs(
    x = "Year",              # x axis title
    y = "Life expectancy",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Continent"      # title of legend
  )

More on Titles, subtitles, and captions: https://www.datanovia.com/en/blog/ggplot-title-subtitle-and-caption/

5.0.4 Themes:

ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
  geom_point() + theme_bw()

ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
  geom_point() + theme_classic()

ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
  geom_point() + theme_test()

ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
  geom_point() + theme_void()

https://ggplot2.tidyverse.org/reference/ggtheme.html

https://jrnold.github.io/ggthemes/reference/index.html

5.0.5 facets

multipanel plots

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp, color = continent)) +
  facet_wrap(~continent) +
  geom_point() 

#we were masking some of Oceania so we allow various Y scales
gapminder %>% 
  ggplot(aes(x = year, y = lifeExp, color = continent)) +
  facet_wrap(~continent, scales = "free_y") +
  geom_point() 

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp, color = continent)) +
  facet_wrap(~continent, scales = "free_y") +
  geom_point() +
  scale_y_continuous(n.breaks = 3)

5.0.6 Saving plots

ggsave

Going to export whatever plot you made last

ggsave("figures/faceted_lifeExpByYear.png", height = 5, width = 7)

5.0.7 Putting plots together

#install.packages("patchwork")
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.0.2

https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ggplotPatchwork.html

6 Knit and Host!